Introduction

We will examine the data sets provided by Human Activity Recognition (HAR) research project (http://groupware.les.inf.puc-rio.br/har) to classify how well a participant in the research performs a specific activity, in this case lifting of a dumb bell. This research is interesting in that it differs from prevalent HAR research into classifying what activity a human subject performs rather than how well the subject performs a known activity. Data is collected from 3 types of sensors (Gyroscope, Accelerometer, and Magnetometer) strapped on four locations including an arm, belt and a forearm of the human subject, in addition to the dumbbell itself. In the given training data set (pml-training.csv), we have 19,622 observations from 6 human subjects with 157 sensory data points plus names of the subjects, the classification of how well the activity was performed (5 classes), and an index. In total, there are 160 columns in the data set. We are also given a testing set (pml-testing.csv) consisting of 20 observations with identical 160 columns – with the exception of the last column being “problem_id” instead of “classe”. We are to apply the machine learning model developed in this project on the testing set to predict performance class for each observation and submit the projections in 20 separate files.

Overall Project Strategy

We will first look at few representative data points to get a feel for what potential predictive capability each might have. We will remove/ignore sensory data points containing many NA’s. We will try out Random Forest algorithm to see how well it works via the caret package. As a comparative analysis of algorithm performance, we will also apply Generalized Boosted Regression Model (GBM) and support Vector Machine (SVM).

We expect the out of sample error to be near less than 1% in terms of accuracy. In the best performing model, cross validation achieved over 99.4% and final prediction on out of sample data achieved 100%. Details on error estimates can be seen in the output from confusionMatrix function in the caret package for each models investigated.

Exploratory Data Analysis

We first load the data sets:

#Clear working space in RStudio
rm(list = ls(all = TRUE))
#load the caret package
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
#reading the provided training data set and final testing set
originalTraining <- read.csv("pml-training.csv",header=T)
originalTesting <- read.csv("pml-testing.csv",header=T)
#Check number of levels in the factor variable classe
(levels(originalTraining $classe))
## [1] "A" "B" "C" "D" "E"
#capture the outcome we are to predict in a separate variable
classCol <- originalTraining $classe
#capture number of human subjects, users, in a variable
(users <- levels(originalTraining$user_name))
## [1] "adelmo"   "carlitos" "charles"  "eurico"   "jeremy"   "pedro"

Let’s take a look at a few sensory data points:

library(ggplot2)
ggplot(originalTraining, aes(x=user_name,y=accel_belt_x,color=classe))+geom_point(position=position_jitter(width=.5),alpha=.3)

plot of chunk unnamed-chunk-2

ggplot(originalTraining, aes(x=user_name,y=yaw_arm,color=classe))+geom_point(position=position_jitter(width=.5),alpha=.3)

plot of chunk unnamed-chunk-2

ggplot(originalTraining, aes(x=user_name,y=accel_arm_x,color=classe))+geom_point(position=position_jitter(width=.5),alpha=.3)

plot of chunk unnamed-chunk-2

ggplot(originalTraining, aes(x=user_name,y=gyros_arm_x,color=classe))+geom_point(position=position_jitter(width=.5),alpha=.3)

plot of chunk unnamed-chunk-2

We will focus on predictors representing sensors, directions, angles and locations and ignore predictors with lots of NAs.

#Columns used for prediction

sensors <- c("gyros","accel","magnet")
directions <- c("x","y","z")
angles <- c("roll","pitch","yaw")
locations <- c("belt","arm","dumbbell","forearm")
#Isolate all predictors with permutations of sensors, directions, and locations
XYZs <- sort( apply( X = expand.grid(sensors,locations,directions) , MARGIN = 1, FUN = function(s) paste(s,collapse="_") ) )
RPYs <- sort( apply( X = expand.grid(angles,locations) , MARGIN = 1, FUN = function(s) paste(s,collapse="_") ) ) 
(inCols <- c("user_name", XYZs, RPYs, "classe"))
##  [1] "user_name"         "accel_arm_x"       "accel_arm_y"      
##  [4] "accel_arm_z"       "accel_belt_x"      "accel_belt_y"     
##  [7] "accel_belt_z"      "accel_dumbbell_x"  "accel_dumbbell_y" 
## [10] "accel_dumbbell_z"  "accel_forearm_x"   "accel_forearm_y"  
## [13] "accel_forearm_z"   "gyros_arm_x"       "gyros_arm_y"      
## [16] "gyros_arm_z"       "gyros_belt_x"      "gyros_belt_y"     
## [19] "gyros_belt_z"      "gyros_dumbbell_x"  "gyros_dumbbell_y" 
## [22] "gyros_dumbbell_z"  "gyros_forearm_x"   "gyros_forearm_y"  
## [25] "gyros_forearm_z"   "magnet_arm_x"      "magnet_arm_y"     
## [28] "magnet_arm_z"      "magnet_belt_x"     "magnet_belt_y"    
## [31] "magnet_belt_z"     "magnet_dumbbell_x" "magnet_dumbbell_y"
## [34] "magnet_dumbbell_z" "magnet_forearm_x"  "magnet_forearm_y" 
## [37] "magnet_forearm_z"  "pitch_arm"         "pitch_belt"       
## [40] "pitch_dumbbell"    "pitch_forearm"     "roll_arm"         
## [43] "roll_belt"         "roll_dumbbell"     "roll_forearm"     
## [46] "yaw_arm"           "yaw_belt"          "yaw_dumbbell"     
## [49] "yaw_forearm"       "classe"
inTraining<-as.data.frame(originalTraining[,inCols])

We now have a training set with 50 columns, from which we will split into training, validation and testing partitions using caret with 60%, 20%, 20% proportions respectively.

set.seed(12345)
indexTrain <- createDataPartition(y= inTraining $classe,p=0.6,list=FALSE)
trainingSet<-inTraining[indexTrain,]
restT<-inTraining[-indexTrain,]
indexV<- createDataPartition(y= restT $classe,p=0.5,list=FALSE)
validationSet<-restT[indexV,]
testingSet<-restT[-indexV,]
dim(trainingSet); dim(validationSet); dim(testingSet);
## [1] 11776    50
## [1] 3923   50
## [1] 3923   50

Training and Predicting

We proceed to fit a model using Random Forest via caret package and use the validation data set to see how well it performed:

set.seed(122333)
fitRF <- train(classe~.,method="rf",data=trainingSet)
## Loading required package: randomForest
## randomForest 4.6-7
## Type rfNews() to see new features/changes/bug fixes.
## Loading required package: class
predictValRF <- predict(fitRF,validationSet)
confusionMatrix(predictValRF,validationSet$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1116    4    0    0    0
##          B    0  752    1    1    1
##          C    0    3  683   11    1
##          D    0    0    0  631    2
##          E    0    0    0    0  717
## 
## Overall Statistics
##                                         
##                Accuracy : 0.994         
##                  95% CI : (0.991, 0.996)
##     No Information Rate : 0.284         
##     P-Value [Acc > NIR] : <2e-16        
##                                         
##                   Kappa : 0.992         
##  Mcnemar's Test P-Value : NA            
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity             1.000    0.991    0.999    0.981    0.994
## Specificity             0.999    0.999    0.995    0.999    1.000
## Pos Pred Value          0.996    0.996    0.979    0.997    1.000
## Neg Pred Value          1.000    0.998    1.000    0.996    0.999
## Prevalence              0.284    0.193    0.174    0.164    0.184
## Detection Rate          0.284    0.192    0.174    0.161    0.183
## Detection Prevalence    0.285    0.192    0.178    0.161    0.183
## Balanced Accuracy       0.999    0.995    0.997    0.990    0.997

We achieved a prediction accuracy of 99.4% with the validation set. We would expect the out-of sample errors while predicting on testing set to be close to what we achieved with validation set:

predictRF <- predict(fitRF,testingSet)
confusionMatrix(predictRF,testingSet$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1112    6    0    0    0
##          B    4  751    6    0    0
##          C    0    2  675    6    2
##          D    0    0    3  634    2
##          E    0    0    0    3  717
## 
## Overall Statistics
##                                         
##                Accuracy : 0.991         
##                  95% CI : (0.988, 0.994)
##     No Information Rate : 0.284         
##     P-Value [Acc > NIR] : <2e-16        
##                                         
##                   Kappa : 0.989         
##  Mcnemar's Test P-Value : NA            
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity             0.996    0.989    0.987    0.986    0.994
## Specificity             0.998    0.997    0.997    0.998    0.999
## Pos Pred Value          0.995    0.987    0.985    0.992    0.996
## Neg Pred Value          0.999    0.997    0.997    0.997    0.999
## Prevalence              0.284    0.193    0.174    0.164    0.184
## Detection Rate          0.283    0.191    0.172    0.162    0.183
## Detection Prevalence    0.285    0.194    0.175    0.163    0.184
## Balanced Accuracy       0.997    0.993    0.992    0.992    0.997

The performance on out of sample testing test is slightly below that on validation set, with an accuracy of 99.1%. Certainly, we could strive for something close to 100% but could risk over-fitting. Nevertheless, as a comparative analysis, we will proceed to apply Support Vector Machine (SVM) with Radial Basis kernel and Generalized Boosted Regression Model (GBM).

Try Support Vector Machine:

set.seed(222333)
fitSVM <- train(classe~.,method="svmRadial",data=trainingSet)
predictValSVM <- predict(fitSVM,validationSet)
confusionMatrix(predictValSVM,validationSet$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1106   64    6    3    1
##          B    3  653   31    6    9
##          C    4   36  633   79   39
##          D    2    1   14  552   22
##          E    1    5    0    3  650
## 
## Overall Statistics
##                                         
##                Accuracy : 0.916         
##                  95% CI : (0.907, 0.925)
##     No Information Rate : 0.284         
##     P-Value [Acc > NIR] : <2e-16        
##                                         
##                   Kappa : 0.894         
##  Mcnemar's Test P-Value : <2e-16        
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity             0.991    0.860    0.925    0.858    0.902
## Specificity             0.974    0.985    0.951    0.988    0.997
## Pos Pred Value          0.937    0.930    0.800    0.934    0.986
## Neg Pred Value          0.996    0.967    0.984    0.973    0.978
## Prevalence              0.284    0.193    0.174    0.164    0.184
## Detection Rate          0.282    0.166    0.161    0.141    0.166
## Detection Prevalence    0.301    0.179    0.202    0.151    0.168
## Balanced Accuracy       0.982    0.922    0.938    0.923    0.949
predictSVM <- predict(fitSVM,testingSet)
confusionMatrix(predictSVM,testingSet$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1103   69    2    3    2
##          B    5  658   43    7    9
##          C    5   30  616   69   18
##          D    1    2   23  561   19
##          E    2    0    0    3  673
## 
## Overall Statistics
##                                         
##                Accuracy : 0.92          
##                  95% CI : (0.912, 0.929)
##     No Information Rate : 0.284         
##     P-Value [Acc > NIR] : <2e-16        
##                                         
##                   Kappa : 0.899         
##  Mcnemar's Test P-Value : <2e-16        
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity             0.988    0.867    0.901    0.872    0.933
## Specificity             0.973    0.980    0.962    0.986    0.998
## Pos Pred Value          0.936    0.911    0.835    0.926    0.993
## Neg Pred Value          0.995    0.968    0.979    0.975    0.985
## Prevalence              0.284    0.193    0.174    0.164    0.184
## Detection Rate          0.281    0.168    0.157    0.143    0.172
## Detection Prevalence    0.301    0.184    0.188    0.154    0.173
## Balanced Accuracy       0.981    0.923    0.931    0.929    0.966

SVM produced an accuracy of 91.6% on validation data set and 92% on testing set - underperforms Random Forest.

We now try GBM:

set.seed(322333)
fitGBM <- train(classe~.,method="gbm", data=trainingSet, verbose = FALSE)
## Loading required package: gbm
## Loading required package: survival
## Loading required package: splines
## 
## Attaching package: 'survival'
## 
## The following object is masked from 'package:caret':
## 
##     cluster
## 
## Loading required package: parallel
## Loaded gbm 2.1
## Loading required package: plyr
predictValGBM <- predict(fitGBM,validationSet)
confusionMatrix(predictValGBM,validationSet$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1102   20    0    2    0
##          B    9  710   11    1    9
##          C    4   23  667   25    7
##          D    1    5    5  611   10
##          E    0    1    1    4  695
## 
## Overall Statistics
##                                        
##                Accuracy : 0.965        
##                  95% CI : (0.959, 0.97)
##     No Information Rate : 0.284        
##     P-Value [Acc > NIR] : <2e-16       
##                                        
##                   Kappa : 0.955        
##  Mcnemar's Test P-Value : NA           
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity             0.987    0.935    0.975    0.950    0.964
## Specificity             0.992    0.991    0.982    0.994    0.998
## Pos Pred Value          0.980    0.959    0.919    0.967    0.991
## Neg Pred Value          0.995    0.985    0.995    0.990    0.992
## Prevalence              0.284    0.193    0.174    0.164    0.184
## Detection Rate          0.281    0.181    0.170    0.156    0.177
## Detection Prevalence    0.287    0.189    0.185    0.161    0.179
## Balanced Accuracy       0.990    0.963    0.978    0.972    0.981
predictGBM <- predict(fitGBM,testingSet)
confusionMatrix(predictGBM,testingSet$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1096   27    0    1    0
##          B   14  713   23    4   13
##          C    4   18  652   21    5
##          D    2    1    9  610   13
##          E    0    0    0    7  690
## 
## Overall Statistics
##                                         
##                Accuracy : 0.959         
##                  95% CI : (0.952, 0.965)
##     No Information Rate : 0.284         
##     P-Value [Acc > NIR] : <2e-16        
##                                         
##                   Kappa : 0.948         
##  Mcnemar's Test P-Value : NA            
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity             0.982    0.939    0.953    0.949    0.957
## Specificity             0.990    0.983    0.985    0.992    0.998
## Pos Pred Value          0.975    0.930    0.931    0.961    0.990
## Neg Pred Value          0.993    0.985    0.990    0.990    0.990
## Prevalence              0.284    0.193    0.174    0.164    0.184
## Detection Rate          0.279    0.182    0.166    0.155    0.176
## Detection Prevalence    0.287    0.196    0.178    0.162    0.178
## Balanced Accuracy       0.986    0.961    0.969    0.971    0.977

GBM produced an accuracy of 96.5% on validation data set and 95.9% on testing set.

Random Forest outperformed both GBM and SMV models for this specific set of predictors we have chosen. Their respective performance might differ if a different set of predictors were used.

As such, to predict the outcome (classe) in the given testing set, we will use fitted Random Forest model.

inColsTest <- c("user_name", XYZs, RPYs)
inTesting<-as.data.frame(originalTesting[,inColsTest])
predictTestRF<-predict(fitRF,inTesting)
(answers<-as.character(predictTestRF))
##  [1] "B" "A" "B" "A" "A" "E" "D" "B" "A" "A" "B" "C" "B" "A" "E" "E" "A"
## [18] "B" "B" "B"

Finally, we will write out the twenty predictions to individual files:

pml_write_files = function(x){
  n = length(x)
  for(i in 1:n){
    filename = paste0("problem_id_",i,".txt")
    write.table(x[i],file=filename,quote=FALSE,row.names=FALSE,col.names=FALSE)
  }
}
pml_write_files(answers)

The prediction on the provided out of sample test data set achieved 100% accuracy after the answers were submitted.

Conclusion

We used sensory data collected from six human subjects while lifting dumb bells to predict how well they performed the task of weight lifting the dumb bell. Specifically, we chose only the 49 relevant data points on sensors, directions, angles, and locations in addition to subjects’ names as predictors in the models. Amongst the machine learning algorithms we applied, Random Forest outperformed Support Vector Machine and Generalized Boosted Regression Model. We could have tried model ensemble approach to see if predictive power could have been increased. But with an accuracy of over 99% achieved by Random Forest, we reasoned that any additional accuracy might risk over-fitting thus lose the power of generalization on other out of sample data sets.